The Annals of Applied Statistics 
2010, Vol. 4, No. 4, 2114-2125 
DOI: 10.1214/10-AOAS347 

© Institute of Mathematical Statistics. 2010 



DETECTION OF TREATMENT EFFECTS BY 
COVARIATE-ADJUSTED EXPECTED SHORTFALL 1 

By Xuming He, Ya-Hui Hsu and Mingxiu Hu 

University of Illinois at Urbana- Champaign, University of Illinois at 
Urbana- Champaign and Millennium Pharmaceuticals, Inc. 

The statistical tests that are commonly used for detecting mean 
or median treatment effects suffer from low power when the two dis- 
tribution functions differ only in the upper (or lower) tail, as in the 
assessment of the Total Sharp Score (TSS) under different treatments 
for rheumatoid arthritis. In this article, we propose a more powerful 
test that detects treatment effects through the expected shortfalls. 
We show how the expected shortfall can be adjusted for covariates, 
and demonstrate that the proposed test can achieve a substantial 
sample size reduction over the conventional tests on the mean effects. 

1. Introduction. We consider the problem of testing the hypothesis of 
no treatment effect against a class of alternatives where the two outcome 
distributions differ only or mainly in the right tail. As demonstrated in 
some recent trials of rheumatoid arthritis therapies in van der Heijde et al. 
(2006) and Kremer et al. (2006), the changes in Total Sharp Scores, the 
primary measurements of the treatment effects on prevention of structural 
damage, are nearly identical for most therapies for nearly 75% of the patient 
population, but the difference lies in the most challenging 25% of the patient 
population where a less effective treatment loses its efficacy, resulting in a 
heavy right tail in its outcome distribution. The two-sample t-test or its 
regression counterpart in covariate-adjusted linear models is commonly used 
for detecting the treatment effects, but due to skewness and heavy-tails of the 
distributions, the test does not have satisfactory power. Nonparametric tests 
on the median differences, for example, would fare even worse in such cases, 
because the median differences are often negligible among those therapies. 
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A natural test in this type of applications is to focus on the average in 
one tail, or the expected tail loss (aka expected shortfall). In finance, this is 
often referred to as the conditional value at risk (CVaR), for measuring the 
risk of a portfolio. In our context, a treatment is said to be more effective if 
it has a smaller expected shortfall, where the expected shortfall is defined to 
be the conditional mean of the outcome (e.g., change in Total Sharp Score) 
above the rth quantile. In this paper, r will be taken to be a user-specified 
value (e.g., 0.75), but a good choice of r clearly depends on the area of 
applications. In finance, the most relevant choices of r fall above 0.90. 

A two-sample comparison of the expected shortfalls is not difficult, as it 
falls into the well-known theory of the L-statistic. In fact, there are also a 
large number of other tests that one can use to compare tails of two outcome 
distributions, but few have been developed to adjust for covariates. The 
purpose of this paper is to develop a simple test for testing the hypothesis 
on the treatment effect adjusting for certain covariates; the proposed test 
uses the COFariate-adjusted Expected Shortfall (COVES). 

Our work starts with a brief introduction to our motivating example on 
the TSS for rheumatoid arthritis therapies in Section 2. In Section 3, we 
propose an appropriate treatment effect size of covariate-adjusted expected 
shortfall, followed by a new test for detecting differences in the treatment 
effects. The large sample theory for the proposed test is given here. In Sec- 
tion 4, we compare the proposed COVES test with the t-test based on the 
least squares regression in empirical power. In particular, we show that when 
the outcome distributions resemble those of the TSS, the COVES test has a 
clear advantage in reducing sample sizes in clinical trials. The basic idea and 
methodology developed in this paper apply to other problems of comparing 
two covariate-adjusted tails of outcome distributions. In Section 5, we pro- 
vide a diagnostic tool that can be used to gauge the need for the proposed 
test and to guide the selection of r. Section 6 concludes the paper with some 
additional remarks about the COVES test. 

2. A primer on total sharp scores. Rheumatoid arthritis (RA) is a chronic 
disabling disease that causes destruction of joint cartilage and erosion of ad- 
jacent bones. In RA clinical trials, TSS is used to measure the treatment 
effect of RA drugs on prevention of structural damage to the joints. It con- 
sists of two components, erosion score and score for joint space narrowing 
(JSN), which are obtained through examination of hand and/or feet joints 
with radiographic methods. The first description of TSS is given by Sharp 
et al. (1971), but TSS has been modified in later studies. The example pre- 
sented in this paper is based on van der Heijde's modification of TSS scoring 
system [van der Heijde (2000)], which is based on examination of 16 areas 
for erosions and 15 for joint space narrowing in each hand. The erosion score 
per joint ranges from to 5 with representing a normal condition and 5 
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Fig. 1. This figure, reproduced from van der Heijde et al. (2006), shows that the changes 
in TSS in the TEMPO trials differ mostly in the upper tails. 

the most severe disease, and thus the total erosion score ranges from to 
160 (16 areas by 2 hands by 5). The JSN score ranges from to 4 per joint 
with higher score representing more severe disease, which leads to a range of 
to 120 (15 areas by 2 hands by 4) for the total JSN score. Therefore, the 
range of TSS is - 280. The primary interest is the change from baseline in 
TSS in one or two years. 

The change in TSS has a highly skewed distribution under any known 
treatment. In the TEMPO trial [van der Heijde et al. (2006)] comparing 
Methotrexate, Etanercept, and the combination therapy of Etanercept and 
Methotrexate, the three treatments are similarly effective for about 75% of 
the patients whose conditions improved or showed no or little progression 
from the baseline; see Figure 1. Medians for all three groups are around 0. 
Treatment differences come from the 25% of the patients with the most pro- 
gressive diseases. In other words, the differences in treatment effects are not 
attributed to a location-scale change in the distributions. The distributions 
of clinical data from several other major RA trials [Kremer et al. (2006); 
Keystone et al. (2004); Lipsky et al. (2000)] showed similar characteristics. 

It is clear that the distributions for the changes in TSS are far from nor- 
mal, and the t-test is expected to lose power due to skewness and heavier- 
tails that are evident in the data. Nonparametric tests on the median differ- 
ences would fare even worse, because the median differences of those treat- 
ments are essentially nonexistent. Researchers in some trials have considered 
the chi-square tests on the proportion of patients with little disease progres- 
sion by dichotomizing TSS, but there has been no agreeable cutoff point for 
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dichotomization. In fact, the power of the chi-square test depends rather 
critically on the cutoff point. In addition, it is difficult to perform the chi- 
square test when a covariate needs to be adjusted for. A natural quantity for 
distinguishing treatment effects is the expected shortfall, which averages the 
changes in TSS in the upper tail. We propose to use the regression quantile 
approach of Koenker and Bassett (1978) to estimate the covariate-adjusted 
expected shortfall. 

Later in this paper, we use a recent observational study conducted at 
Brigham and Women's Hospital and sponsored by Millennium Pharmaceu- 
ticals Inc. and Biogen Idee as a basis for assessing the performance of the 
proposed test. We take 150 subjects in the study, who are under active treat- 
ment, and simulate a control group whose outcome distribution is chosen to 
mimic the treatment difference reported in other trials. For example, in the 
Adalimumab trial [Keystone et al. (2004)], the variance of the treatment 
group (using the drug Adalimumab 20 mg/kg) is about half of that in the 
control group (using the drug Methotrexate) with a mean difference of —1.9. 
In the Abatacept trial [Kremer et al. (2006)], the variance in the Abatacept 
group is about one third of that in the control group. In our simulation stud- 
ies, we use the ratio of variances between 2:1 and 3:1 between two treatment 
groups. 

3. Proposed test: COVES. We use a dummy variable D as treatment 
indicator, C as the covariate of interest, and Z as the outcome measure. 
For simplicity of notation, we consider C € R as a univariate covariate and 
D taking values or 1, but the work generalizes readily for multivariate 
covariates and multiple treatments. As appropriate with randomized trials, 
we assume that C and D are independent. We model the rth quantile of Z 
given (D, C) as 

(1) Q z (T\D,C) = a(r)+5(T)D + j(T)C, 

where the coefficients a, S, and 7 are r-specific. In this paper, we use r = 0.75 
for empirical studies, but refer to Section 5 for guidance on the selection of r. 
We also refer to Koenker (2005) for details on the linear regression quantile 
specification. 

Given data (Zi,Di, Ci) with Di = 1 for i = 1, . . . , m and Di = for i = m + 
1, . . . , m + n, we can use the quantreg package in R to obtain the regression 
quantile coefficient a, 5, and 7. Then, let ii = Zi — a — 5Di — jCi as the 
residuals from the rth regression quantile. By contrast, we also write = 
Zi — a(r) — 5(r)Di — 7(r)Cj, which has zero as the rth conditional quantile 
given (Di,Ci) due to (1). 

Let Y{ = Z{ — jCi be the covariate-adjusted outcome, and define the em- 
pirical covariate-adjusted expected shortfall for the two groups as 

COVES T (d) = J2 w ^ Y i' d = 0, 1, 

Di=d 
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where w d ^ = S^Ife > 0) and S d = J^D^d 1 ^ > °)- The quantity COVES T (ef) 
is the average of the outcomes for group d that are above the rth covariate- 
adjusted quantile. 

The proposed COVES test statistic for the hypothesis of no difference 
between the two treatment groups is given as 

(2) T T (m,n) = COVES r (l) -COVES T (0). 

Let C T (d) and e T (d) be the average of C{ and ei in group d that are above 
the rth regression quantile, that is, 

Cr(d) = S d ' 1 Cil(ii>0), 

D i= d 

er(d) = S^ 1 ( Z i ~ "(0 " <*0)A ~ l{r)Ci)I{ei > 0). 

D i= d 

Then, the test statistic (2) can be written as 

(3) T T (m,n) = 5(t) - (7 - -y(t))(C t (1) - C T (0)) + (e r (l) - e r (0)), 

which makes it relatively easy to establish the asymptotic normality of the 
test statistic as m, n — > 00. 

To estimate the variance of T T (m,n), let = Yli^i^i = d), fi be the 
conditional density function of e, given evaluated at 0, and 

i 

as the orthogonal components C relative to the treatment groups. In more 
general problems, we can obtain C* by the Gram-Schmidt orthogonalization 
of the design matrix. Furthermore, let 

r 1 2 

v d = > °)> - N d x E > °)> 

D t =d *-D z =d 



and 

(4) 



i 

s 2 ^ n = (l-rr 2 (V 1 /m 2 + Vo/n 2 ) 

+ r(l - r)(C r (l) - C T (0)) 2 Uy 2 (jT Cf ) 



Theorem 3.1. Suppose that lim mjn _ i . 00 (m + n)" 1 ^ exists, E\Ci\ 3 < oo, 
and fi are uniformly bounded away from and infinity. Under the null 
hypothesis that F z \c,d=\ — Fz\C,D=0; we have 

T T COVES (m, n) /s m>n -> JV(0, 1) asm,n^ oo. 
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The proof of Theorem 3.1 is given in the Appendix, but to use the asymp- 
totic normality for testing the null hypothesis of no treatment effects, we 
need a consistent estimate of Uf. If ei in each group (corresponding to Di = 
or 1) follows a common distribution, then a kernel density estimate can be 
used to estimate the common density at from in the dth group. If the 
conditional densities vary with Cj, it is not possible to estimate each /j 
consistently, but Uf, a linear combination of the /j's, can still be consis- 
tently estimated; see He, Fung and Zhu (2002) and Koenker (2005) for more 
details. For the empirical investigations in this paper, the proposed test is 
carried out using a kernel density estimate, density, in R on each treatment 
group. 

4. Empirical investigations. In this section, we report some empirical 
power studies of the proposed test based on Monte Carlo simulations. The 
first study is constructed based on the data we obtained from a recent study 
on an undisclosed therapy to treat RA at the Brigham and Women's Hospi- 
tal in Boston. The other studies are constructed with other types of distri- 
butions in mind. Together, we find that the proposed COVES test greatly 
outperforms the usual regression tests on the mean differences when the 
group differences occur at one tail of the distributions. 

4.1. Targeted study on TSS. We use the empirical distributions, F, of 
the TSS changes of 150 patients in the Brigham and Women's Hospital study 
as the underlying distribution for the group d= 1. We take the baseline TSS 
as the covariate in the analysis, whose empirical distribution for the group 
d = 1 will be denoted as G. 

The data from the control group (with d = 0) will be simulated as 

C = G _1 (u), Z = F~ x (u) + 8|u - 0.65| 1/4 /(u > 0.65), 

where u is a uniform random number in (0, 1). Clearly, the control group 
has a heavier right tail in its outcome, but the covariate C has the same 
distribution in both groups. In this setting, the variance of the control group 
is about twice that of the treatment group. Table 1 and Figure 2 summarize 
the differences of the two groups. 



Table 1 

Differences in the rth quantiles and in the mean, with the last column as the ratio of the 
variances between the control group (d = 0) and the treatment group (d=l) 



T 


0.5 


0.6 


0.7 


0.75 


0.8 


0.9 


0.99 


Mean 


Variance ratio 










3.72 


4.53 


4.96 


5.64 


6.02 


1.74 


2.03 
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Fig. 2. Quantile function of the TSS change shows that the groups differ mostly in the 
upper tails. 
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Fig. 3. Statistical powers of three tests in the targeted study on TSS as functions of 
sample size m — n. The ES test ignores the covariate in the model. 

The power functions for the COVES test with r = 0.75 and the t-test from 
linear regression are shown in Figure 3 with sample sizes up to m = n = 350. 
For comparison, we also include in the figure the power curve for the test 
based on expected shortfalls (ES) without adjusting for the baseline TSS. 
Table 2 provides the sample sizes needed to reach a power of 0.90 in clinical 
trials with m = n as well as m = 2n. It is common in clinical experiments to 
allocate twice as many patients to the treatment group when the treatment 
is believed to be effective. In this case, the baseline TSS does not play a 
significant role, so the statistical power for detecting the treatment effect 
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Table 2 

Sample sizes needed to reach power 0.9. The cases 
of m = n and m = 2n are included 



Sample size (m, n) 

COVES test (r = 0.75) (120, 120) or (172, 86) 
t-test (306, 306) or (450, 225) 



Table 3 

Difference of the two groups at r\ = 1.35, with the last column for 
the ratio of error variances 



t 0.5 


0.6 


0.7 


0.75 


0.8 


0.9 


Mean 


Var ratio 





0.34 


0.70 


0.91 


1.13 


1.72 


0.54 


2.97 



has no gain by adjusting the covariate in the analysis. However, the results 
show that the COVES test is clearly outperforming the t-test, and the latter 
would require a trial that is more than double in size. 

4.2. More simulation studies. We consider data generated from 
(5) Zi = 5 + 7 Ci + {1 + r)I(ei > 0)/(A = 0)K 

where ~ N(0,1), and r] is either (under the null hypothesis) or 1.35 
(under the alternative hypothesis). The coefficient 7 and the distribution 
for the covariate Cj will be specified later. Clearly, the control group (d = 0) 
has a heavier right tail. When i] = 1.35, the error variance of the control 
group (d = 0) is about triple that of the treatment group id = 1) under 
this model. Table 3 summarizes the differences of the two groups under the 
alternative hypothesis. 

We will consider four scenarios for the effects of the covariate in the anal- 
ysis: 

• Scenario 1, no covariate effect: we take Cj from N{2.5, 0.5 2 ), with 7 = 0. 

• Scenario 2, a common covariate effect: we take Cj from iV(2.5, 0.5 2 ), with 
7 = 1. 

• Scenario 3, a covariate distribution that varies with treatment groups: we 
take d from iV(2.5,0.5 2 ) for d = 0, but from iV(3.0,0.5 2 ) for d = l, with 
7=1. 

• Scenario 4; a covariate distribution that has a scale change across treat- 
ment groups: we take Cj from iV(2.5,0.5 2 ) for d = 0, but from iV(2.5, 1.0) 
for d=l, with 7=1. 
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Table 4 

Simulation comparisons for the COVES test versus t-test for linear models. The sample 





sizes 


under two conditions m = 


n and m = 2n 


are given 






COVES test 




t-test 




Type I error Sample size (m, n) 


Type I error 


Sample size (m, n) 




(m, n) = 


; needed to reach 


(m, n) = 


needed to reach 


Scenario 


(50, 50) 


power 0.9 


(50, 50) 


power 0.9 


1 


0.046 


(51, 51) or (92, 46) 


0.050 


(140, 140) or (202, 101) 


2 


0.051 


(51, 51) or (92, 46) 


0.049 


(140, 140) or (202, 101) 


3 


0.048 


(59, 59) or (100, 50) 


0.050 


(177, 177) or (240, 120) 


4 


0.053 


(50, 50) or (92, 46) 


0.052 


(140, 140) or (200, 100) 



Scenarios 3 and 4 are unlikely for randomized trials, but we include them 
in the study to examine the robustness of the COVES test when the covariate 
distributions vary to some extent with the treatment groups. The type I 
errors of the COVES test and the t-test under these scenarios are controlled 
to stay close to the nominal level of 0.05. The following table reports the 
type I errors at the sample size of m = n = 50. It also reports the sample sizes 
needed to reach power of 0.90 in each scenario under two design conditions: 
m = n and m = 2n, respectively. 

The results clearly show the efficiency of the COVES test. In Scenarios 
2-4, the adjustment of the covariate is important, because the ES test con- 
sidered in Section 4.1 would not be valid, and thus it is not presented in this 
subsection. 

5. A diagnostic tool for COVES. When preliminary or full data are 
available, it is often helpful to have a simple diagnostic tool that points to a 
case in favor of the COVES test. We suggest examining the quantile function 
plot, as used in Figure 1, but applied to the covariate-adjusted outcomes 
defined in Section 3. When the quantiles of covariate-adjusted outcomes 
from different treatment groups differ mostly in one tail, we have a clear case 
in favor of the COVES test or a similar test that focuses on the tail. In fact, 
the plot can also suggest an appropriate level of r to be used for COVES. 
To illustrate this point, we simulated one data set of size m = n = 60 from 
Scenario 3 in Section 4.2 with rj = 1.35 in model (5). Unsure about a good 
choice of r, we considered using the covariate-adjusted outcomes from three 
quantile levels 0.5, 0.75, and 0.9, and examined the resulting quantile plots 
in Figure 4. No matter which quantile level we started with, the quantile 
plots of the covariate-adjusted outcomes look similar, and they all suggest 
that the COVES test with r around 0.75 would be a good choice. On the 
other hand, if the quantile functions of different treatment groups show a 
vertical shift, we would then favor the t-test to the COVES test. 
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(c) 

Fig. 4. Quantile function plots of the covariate- adjusted outcomes; the adjustments are 
made based on regression quantile at (&) t — 0.5, (b) r = 0.75, (c) t = 0.9. The diagnostic 
plots are insensitive to the initial choice of t. 

6. Conclusions. The proposed COVES test aims to detect treatment ef- 
fects that are reflected mostly in the upper (or lower) tail of the outcome 
distributions. The test is powered up by the use of the expected shortfall as a 
natural differentiating quantity in such applications. We find that the regres- 
sion quantile methodology is appropriate and convenient for computing the 
covariate-adjusted expected shortfall in the test. Our study on the change of 
the Total Sharp Scores due to different treatments on rheumatoid arthritis 
shows that a substantial sample size reduction over the conventional i-test 
based on linear models can be achieved. 
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In this paper, we used r = 0.75 in the proposed COVES test, because 
it serves two purposes in the application. First, earlier studies have shown 
conventional rheumatoid arthritis treatments are effective for nearly 75% of 
the patient population, so it is less meaningful to detect differences below 
the 75th percentile. Second, a more effective treatment should work well for 
a substantial portion of the patients, so if we set r to be too high in the 
COVES test, a significant difference in the upper tail might be difficult to 
detect statistically. Finally, we note that the development of the COVES 
test in this paper was made in response to the randomized clinical studies 
on rheumatoid arthritis treatments, but the basic idea and the methodology 
clearly generalize to other problems (where tail differences of possibly other 
r values are) of interest. In general, we suggest using quantile function plots 
on covariate-adjusted outcomes as a simple diagnostic tool for suggesting a 
good choice of r. 

APPENDIX: SKETCH OF PROOF 

The following lemma follows directly from the consistency and the Ba- 
hadur representation of regression quantile estimators; see Koenker [(2005), 
Section 4.3] and He and Shao (1996). 

Lemma 1. If {(Zi,Di,Ci)} is a random sample satisfying (1), lim min _ >00 (m + 
n)~ l Uf exists, i?|Cj| 3 < oo, and fi are uniformly bounded away from and 
infinity, then we have the Bahadur representation on 7 

7 _ 7(r) = _[/■-! C*I(e t < 0) + o„((m + n)~ 1 /2 )) 

i 

and the representation on e T (d) 



M<0-{£ Jfe>0)] Yl e l I(e l >0) = o p ((m+ n)" 1 ^ 

^Di=d ' D,=d 



where Uf = X^(/«^? 2 )> fi is the conditional density function of ej given 
(A.Ci) evaluated at 0, and C * = d - Nj l ^ CU(A = d). 

Proof of Theorem 3.1. By replacing e» in T T (m,n) by ej and using 
the results in Lemma 1, we approximate T T (m,n) by 



T*(m,n) = 5(T) + 



{(l-r)m}- 1 l(ei>0)ei 



Di=l 



- (CV(1) - cv(o))^ 1 £ C*I( ei > 0) 

D i= l 
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{(l-r)n}- 1 ^ l(e l >0)e l 



D i= 

+ (c T (i) - Crmuy 1 £ c*i{ &i > o) 

Di=0 

It is clear that E(T*{m,n)) = S(r) = under Hq, and T*(m,n) is asymp- 
totically normal, with 

var(T*(m, n)) 

= {(1 - r)m}" 2 Y, (E{ep( ei > 0)} - [E{ ei I(e, t > 0)}] 2 ) 
A=l 

+ r(l - r)(C T (l) - C T (0)) 2 E7 a E( C *) 2 

i 

+ {(1 " r)n}' 2 £ (^{e 2 I( ei > 0)} - [E{ ei I( ei > 0)}] 2 ). 

Again, by Lemma 1 and T T {m,n) —T*{m,n) = o p ((rn + ra) -1 / 2 ), the asymp- 
totic normality of Theorem 3.1 follows. □ 
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